31 research outputs found

    Processing genome-wide association studies within a repository of heterogeneous genomic datasets

    Get PDF
    Background Genome Wide Association Studies (GWAS) are based on the observation of genome-wide sets of genetic variants – typically single-nucleotide polymorphisms (SNPs) – in different individuals that are associated with phenotypic traits. Research efforts have so far been directed to improving GWAS techniques rather than on making the results of GWAS interoperable with other genomic signals; this is currently hindered by the use of heterogeneous formats and uncoordinated experiment descriptions. Results To practically facilitate integrative use, we propose to include GWAS datasets within the META-BASE repository, exploiting an integration pipeline previously studied for other genomic datasets that includes several heterogeneous data types in the same format, queryable from the same systems. We represent GWAS SNPs and metadata by means of the Genomic Data Model and include metadata within a relational representation by extending the Genomic Conceptual Model with a dedicated view. To further reduce the gap with the descriptions of other signals in the repository of genomic datasets, we perform a semantic annotation of phenotypic traits. Our pipeline is demonstrated using two important data sources, initially organized according to different data models: the NHGRI-EBI GWAS Catalog and FinnGen (University of Helsinki). The integration effort finally allows us to use these datasets within multisample processing queries that respond to important biological questions. These are then made usable for multi-omic studies together with, e.g., somatic and reference mutation data, genomic annotations, epigenetic signals. Conclusions As a result of our work on GWAS datasets, we enable 1) their interoperable use with several other homogenized and processed genomic datasets in the context of the META-BASE repository; 2) their big data processing by means of the GenoMetric Query Language and associated system. Future large-scale tertiary data analysis may extensively benefit from the addition of GWAS results to inform several different downstream analysis workflows

    Genomic data integration and user-defined sample-set extraction for population variant analysis

    Get PDF
    Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics

    Exploiting ladder networks for gene expression classification

    Get PDF
    The application of deep learning to biology is of increasing relevance, but it is difficult; one of the main difficulties is the lack of massive amounts of training data. However, some recent applications of deep learning to the classification of labeled cancer datasets have been successful. Along this direction, in this paper, we apply Ladder networks, a recent and interesting network model, to the binary cancer classification problem; our results improve over the state of the art in deep learning and over the conventional state of the art in machine learning; achieving such results required a careful adaptation of the available datasets and tuning of the network

    VariantHunter: a method and tool for fast detection of emerging SARS-CoV-2 variants

    Get PDF
    With the progression of the COVID-19 pandemic, large datasets of SARS-CoV-2 genome sequences were collected to closely monitor the evolution of the virus and identify the novel variants/strains. By analyzing genome sequencing data, health authorities can 'hunt' novel emerging variants of SARS-CoV-2 as early as possible, and then monitor their evolution and spread. We designed VariantHunter, a highly flexible and user-friendly tool for systematically monitoring the evolution of SARS-CoV-2 at global and regional levels. In VariantHunter, amino acid changes are analyzed over an interval of 4 weeks in an arbitrary geographical area (continent, country, or region); for every week in the interval, the prevalence is computed and changes are ranked based on their increase or decrease in prevalence. VariantHunter supports two main types of analysis: lineage-independent and lineage-specific. The former considers all the available data and aims to discover new viral variants. The latter evaluates specific lineages/viral variants to identify novel candidate designations (sub-lineages and sub-variants). Both analyses use simple statistics and visual representations (diffusion charts and heatmaps) to track viral evolution. A dataset explorer allows users to visualize available data and refine their selection. VariantHunter is a web application free to all users. The two types of supported analysis (lineage-independent and lineage-specific) allow user-friendly monitoring of the viral evolution, empowering genomic surveillance without requiring any computational background. Database URL http://gmql.eu/variant_hunter/

    Association of COVID-19 Vaccinations With Intensive Care Unit Admissions and Outcome of Critically Ill Patients With COVID-19 Pneumonia in Lombardy, Italy

    Get PDF
    IMPORTANCE Data on the association of COVID-19 vaccination with intensive care unit (ICU) admission and outcomes of patients with SARS-CoV-2-related pneumonia are scarce. OBJECTIVE To evaluate whether COVID-19 vaccination is associated with preventing ICU admission for COVID-19 pneumonia and to compare baseline characteristics and outcomes of vaccinated and unvaccinated patients admitted to an ICU. DESIGN, SETTING, AND PARTICIPANTS This retrospective cohort study on regional data sets reports: (1) daily number of administered vaccines and (2) data of all consecutive patients admitted to an ICU in Lombardy, Italy, from August 1 to December 15, 2021 (Delta variant predominant). Vaccinated patients received either mRNA vaccines (BNT162b2 or mRNA-1273) or adenoviral vector vaccines (ChAdOx1-S or Ad26.COV2). Incident rate ratios (IRRs) were computed from August 1, 2021, to January 31, 2022; ICU and baseline characteristics and outcomes of vaccinated and unvaccinated patients admitted to an ICU were analyzed from August 1 to December 15, 2021. EXPOSURES COVID-19 vaccination status (no vaccination, mRNA vaccine, adenoviral vector vaccine). MAIN OUTCOMES AND MEASURES The incidence IRR of ICU admission was evaluated, comparing vaccinated people with unvaccinated, adjusted for age and sex. The baseline characteristics at ICU admission of vaccinated and unvaccinated patients were investigated. The association between vaccination status at ICU admission and mortality at ICU and hospital discharge were also studied, adjusting for possible confounders. RESULTS Among the 10 107 674 inhabitants of Lombardy, Italy, at the time of this study, the median [IQR] agewas 48 [28-64] years and 5 154 914 (51.0%) were female. Of the 7 863 417 individuals who were vaccinated (median [IQR] age: 53 [33-68] years; 4 010 343 [51.4%] female), 6 251 417 (79.5%) received an mRNA vaccine, 550 439 (7.0%) received an adenoviral vector vaccine, and 1 061 561 (13.5%) received a mix of vaccines and 4 497 875 (57.2%) were boosted. Compared with unvaccinated people, IRR of individuals who received an mRNA vaccine within 120 days from the last dosewas 0.03 (95% CI, 0.03-0.04; P <.001), whereas IRR of individuals who received an adenoviral vector vaccine after 120 days was 0.21 (95% CI, 0.19-0.24; P <.001). There were 553 patients admitted to an ICU for COVID-19 pneumonia during the study period: 139 patients (25.1%) were vaccinated and 414 (74.9%) were unvaccinated. Compared with unvaccinated patients, vaccinated patients were older (median [IQR]: 72 [66-76] vs 60 [51-69] years; P <.001), primarily male individuals (110 patients [ 79.1%] vs 252 patients [60.9%]; P <.001), with more comorbidities (median [IQR]: 2 [1-3] vs 0 [0-1] comorbidities; P <.001) and had higher ratio of arterial partial pressure of oxygen (PaO2) and fraction of inspiratory oxygen (FiO(2)) at ICU admission (median [IQR]: 138 [100-180] vs 120 [90-158] mm Hg; P =.007). Factors associated with ICU and hospital mortality were higher age, premorbid heart disease, lower PaO2/FiO(2) at ICU admission, and female sex (this factor only for ICU mortality). ICU and hospital mortality were similar between vaccinated and unvaccinated patients. CONCLUSIONS AND RELEVANCE In this cohort study, mRNA and adenoviral vector vaccines were associated with significantly lower risk of ICU admission for COVID-19 pneumonia. ICU and hospital mortality were not associated with vaccinated status.These findings suggest a substantial reduction of the risk of developing COVID-19-related severe acute respiratory failure requiring ICU admission among vaccinated people

    META-BASE: a Novel Architecture for Large-Scale Genomic Metadata Integration

    Get PDF
    The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies. It is well-recognized in the bioinformatics community because, in the common practice, repositories are accessed one-by-one, learning their specific metadata definitions as result of long and tedious efforts, and such practice is error-prone. In this paper, we describe META-BASE, an architecture for integrating metadata extracted from a variety of genomic data sources, based upon a structured transformation process. We present a variety of innovative techniques for data extraction, cleaning, normalization and enrichment. We propose a general, open and extensible pipeline that can easily incorporate any number of new data sources, and propose the resulting repository - already integrating several important sources - which is exposed by means of practical user interfaces to respond biological researchers' needs

    Protein-protein interaction associated disorders revealed via data integration

    No full text
    Interactions between proteins are important for the majority of biological functions and it is used very frequently by biologists and bioinformaticians to interpret experimental results in the context of biomolecular interaction networks and test their biomedical hypotheses. Numerous protein-protein interaction (PPI) data are provided by using new powerful high-throughput experimental and computational techniques; they are being collected in several different databases, which include IntAct, BioGrid, BIND, DIP, HPRD and MINT. There is no single database which covers whole interaction data, and also these data generally do not contain phenotypic or even functional or structural information about the interactors, which in many cases are available in other databases. Thus, with the purpose of having widespread coverage, it is a necessity to combine the data from different databases, often provided in different formats. In particular, no information is available about the association of protein-protein interactions with genetic disorders. For this purpose, we are developing a software framework to create and maintain a data warehouse that integrates information from many data sources on the basis of a conceptual data model that relates molecular entities and biomedical features. As another step, we developed an automatic association inference method, based on the transitive closure concept, and applied it on the integrated data. In particular, by leveraging protein-protein interaction data, provided by the IntAct and MINT databases, and protein encoding gene data form the Entrez Gene database, we inferred gene interaction networks. In addition, by taking advantage of genetic disorder and phenotype data provided by the OMIM database, we inferred associations between proteins and genetic disorders and their phenotypes. Then, in order to identify genetic disorders possibly associated with protein-protein interactions, we looked for those interacting proteins that resulted associated with the same genetic disorder. PPI data files downloaded from MINT and IntAct databases were automatically parsed and data of 46,154 human protein-protein interactions (out of the 254,048 protein-protein interactions contained of 397 different organisms’ proteins) regarding 12,178 distinct human proteins (out of the 326,766 human proteins in the data warehouse), were imported in the data warehouse. These human proteins are encoded by 11,232 different human genes. By applying transitive closure concept, we identified 1,130 gene networks and found 1,136 human protein-protein interactions associated with 628 genetic disorders (such as: Alzheimer, Cystic fibrosis, Diabetes mellitus, Parkinson…), which are related to 86 clinical synopses and 3,481 phenotypes. It is possible to extract the interactions between the proteins which encode the genes, are associated to the specific disease. These interactions will lead researchers to focus on specific proteins. One or more proteins defection could be altered by functional interaction with the other proteins. If these relations could be found, then possibly a disease treatment strategy such as synthetic protein engineering could be applied. This hypothesis show the importance of the integration of the protein-protein interaction data with the genetic disorder data and this will helps to scientists in understanding the annotations of biological data which are distributed in different databanks

    A review on viral data sources and search systems for perspective mitigation of COVID-19

    Get PDF
    With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects. Few examples of host-pathogen integrated datasets exist so far, but we expect them to grow together with the knowledge of COVID-19 disease; once such datasets will be available, useful integrative surveillance mechanisms can be put in place by observing how common variants distribute in time and space, relating them to the phenotypic impact evidenced in the literature
    corecore